In this notebook, we'll use IPython.parallel (IPP) and rpy2 as a quick-and-dirty way of parallelizing work in R. We'll use a cluster of IPP engines running on the same VM as the notebook server to demonstarate. We'll also need to install rpy2 before we can start.
!pip install rpy2
In [20]:
from IPython.html.services.clusters.clustermanager import ClusterManager
In [21]:
cm = ClusterManager()
We have to list the profiles before we can start anything, even if we know the profile name.
In [60]:
cm.list_profiles()
Out[60]:
For our demo purposes, we'll just use the default profile which starts a cluster on the local machine for us.
In [61]:
cm.start_cluster('default')
Out[61]:
After running the command above, we need to pause for a few moments to let all the workers come up. (Breathe and count 10 ... 9 ... 8 ...)
Now we can continue to create a DirectView that can talk to all of the workers. (If you get an error, breathe, count so more, and try again in a few.)
In [27]:
import IPython.parallel
In [62]:
client = IPython.parallel.Client()
In [63]:
dv = client[:]
In my case, I have 8 CPUs so I get 8 workers by default. Your number will likely differ.
In [72]:
len(dv)
Out[72]:
To ensure the workers are functioning, we can ask each one to run the bash command echo $$
to print a PID.
In [64]:
%%px
!echo $$
In [ ]:
%%px
%load_ext rpy2.ipython
Now we can tell every engine to run R code using the %%R
(or %R
) magic. Let's sample 50 random numbers from a normal distribution.
In [77]:
%%px
%%R
x <- rnorm(50)
summary(x)
In [78]:
%%px
%Rpull x
x = list(x)
In [79]:
x = dv.gather('x', block=True)
We should get 50 elements per engine.
In [80]:
assert len(x) == 50 * len(dv)
In [81]:
cm.stop_cluster('default')
Out[81]: